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Abstract 

Natural language generation of coherent 
long texts like paragraphs or longer doc¬ 
uments is a challenging problem for re¬ 
current networks models. In this paper, 
we explore an important step toward this 
generation task: training an LSTM (Long- 
short term memory) auto-encoder to pre¬ 
serve and reconstruct multi-sentence para¬ 
graphs. We introduce an LSTM model that 
hierarchically builds an embedding for a 
paragraph from embeddings for sentences 
and words, then decodes this embedding 
to reconstruct the original paragraph. We 
evaluate the reconstructed paragraph us¬ 
ing standard metrics like ROUGE and En¬ 
tity Grid, showing that neural models arc 
able to encode texts in a way that preserve 
syntactic, semantic, and discourse coher¬ 
ence. While only a first step toward gener¬ 
ating coherent text units from neural mod¬ 
els, our work has the potential to signifi¬ 
cantly impact natural language generation 
and summarization 1 . 

1 Introduction 

Generating coherent text is a central task in natural 
language processing. A wide variety of theories 
exist for representing relationships between text 
units, such as Rhetorical Structure Theory (Mann 
and Thompson, 1988) or Discourse Representa¬ 
tion Theory (Lascarides and Asher, 1991), for ex¬ 
tracting these relations from text units (Marcu, 
2000; LeThanh et ah, 2004; Hernault et al., 2010; 
Feng and Hirst, 2012, inter alia), and for extract¬ 
ing other coherence properties characterizing the 
role each text unit plays with others in a discourse 
(Barzilay and Lapata, 2008; Barzilay and Lee, 

'Code for models described in this paper are available at 

www.stanford.edu/~ jiweil/. 


2004; Eisner and Charniak, 2008; Li and Hovy, 
2014, inter alia). However, applying these to text 
generation remains difficult. To understand how 
discourse units are connected, one has to under¬ 
stand the communicative function of each unit, 
and the role it plays within the context that en¬ 
capsulates it, recursively all the way up for the 
entire text. Identifying increasingly sophisticated 
human-developed features may be insufficient for 
capturing these patterns. But developing neural- 
based alternatives has also been difficult. Al¬ 
though neural representations for sentences can 
capture aspects of coherent sentence structure (Ji 
and Eisenstein, 2014; Li et ah, 2014; Li and Hovy, 
2014), it’s not clear how they could help in gener¬ 
ating more broadly coherent text. 

Recent LSTM models (Hochreiter and Schmid- 
huber, 1997) have shown powerful results on gen¬ 
erating meaningful and grammatical sentences in 
sequence generation tasks like machine translation 
(Sutskever et al., 2014; 0; Luong et al., 2015) or 
parsing (Vinyals et al., 2014). This performance is 
at least partially attributable to the ability of these 
systems to capture local compositionally: the way 
neighboring words are combined semantically and 
syntactically to form meanings that they wish to 
express. 

Could these models be extended to deal with 
generation of larger structures like paragraphs or 
even entire documents? In standard sequence- 
to-sequence generation tasks, an input sequence 
is mapped to a vector embedding that represents 
the sequence, and then to an output string of 
words. Multi-text generation tasks like summa¬ 
rization could work in a similar way: the sys¬ 
tem reads a collection of input sentences, and 
is then asked to generate meaningful texts with 
certain properties (such as—for summarization— 
being succinct and conclusive). Just as the local 
semantic and syntactic compositionally of words 
can be captured by LSTM models, can the com- 



positionally of discourse releations of higher-level 
text units (e.g., clauses, sentences, paragraphs, and 
documents) be captured in a similar way, with 
clues about how text units connect with each an¬ 
other stored in the neural compositional matrices? 

In this paper we explore a first step toward this 
task of neural natural language generation. We fo¬ 
cus on the component task of training a paragraph 
(document)-to-paragraph (document) autoencoder 
to reconstruct the input text sequence from a com¬ 
pressed vector representation from a deep learn¬ 
ing model. We develop hierarchical LSTM mod¬ 
els that arranges tokens, sentences and paragraphs 
in a hierarchical structure, with different levels of 
LSTMs capturing compositionality at the token- 
token and sentence-to-sentence levels. 

We offer in the following section to a brief de¬ 
scription of sequence-to-sequence LSTM models. 
The proposed hierarchical LSTM models are then 
described in Section 3, followed by experimental 
results in Section 4, and then a brief conclusion. 

2 Long-Short Term Memory (LSTM) 

In this section we give a quick overview of LSTM 
models. LSTM models (Hochreiter and Schmid- 
huber, 1997) are defined as follows: given a 
sequence of inputs X = {x\,x 2 ,...,x nx }, an 
LSTM associates each timestep with an input, 
memory and output gate, respectively denoted as 
it, ft and ot . For notations, we disambiguate e and 
h where et denote the vector for individual text 
unite (e.g., word or sentence) at time step t while 
ht denotes the vector computed by LSTM model 
at time t by combining et and ht- 1 - g denotes the 
sigmoid function. The vector representation h t for 
each time-step t is given by: 
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where W G ]g>4Ax2A j n se q Uence -t 0 -sequence 
generation tasks, each input X is paired with 
a sequence of outputs to predict: Y = 
{j/i, ij 2 , ..., y nY }. An LSTM defines a distribution 
over outputs and sequentially predicts tokens us¬ 


ing a softmax function: 
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f(ht-i,e yt ) denotes the activation function be¬ 
tween e/j_i and e yt , where ht -1 is the representa¬ 
tion outputted from the LSTM at time t — 1. Note 
that each sentence ends up with a special end-of- 
sentence symbol <end>. Commonly, the input 
and output use two different LSTMs with differ¬ 
ent sets of convolutional parameters for capturing 
different compositional patterns. 

In the decoding procedure, the algorithm termi¬ 
nates when an <end> token is predicted. At each 
timestep, either a greedy approach or beam search 
can be adopted for word prediction. Greedy search 
selects the token with the largest conditional prob¬ 
ability, the embedding of which is then combined 
with preceding output for next step token predic¬ 
tion. For beam search, (Sutskever et ah, 2014) dis¬ 
covered that a beam size of 2 suffices to provide 
most of benefits of beam search. 

3 Paragraph Autoencoder 

In this section, we introduce our proposed hierar¬ 
chical LSTM model for the autoencoder. 

3.1 Notation 

Let D denote a paragraph or a document, which 
is comprised of a sequence of No sentences, 
D = {s 1 , s 2 ,..., s N °, endo}- An additional 
”endo ” token is appended to each document. 
Each sentence s is comprised of a sequence of 
tokens s = {w 1 ,w 2 , ...,w Ns } where N s denotes 
the length of the sentence, each sentence end¬ 
ing with an “end s ” token. The word w is as¬ 
sociated with a if-dimensional embedding e w , 
e w = {e^,e^, ...,e^}. Let V denote vocabu¬ 
lary size. Each sentence s is associated with a K- 
dimensional representation e s . 

An autoencoder is a neural model where output 
units are directly connected with or identical to in¬ 
put units. Typically, inputs arc compressed into 
a representation using neural models (encoding), 
which is then used to reconstruct it back (decod¬ 
ing). For a paragraph autoencoder, both the input 
X and output Y are the same document D. The 



autoencoder first compresses D into a vector rep¬ 
resentation efj and then reconstructs D based on 

e-D- 

For simplicity, we define LSTM(ht~i,et ) to 
be the LSTM operation on vectors fit -1 and e* to 
achieve h t as in Equ.l and 2. For clarification, 
we first describe the following notations used in 
encoder and decoder: 

• hf and hf denote hidden vectors from LSTM 
models, the subscripts of which indicate 
timestep t, the superscripts of which indi¬ 
cate operations at word level (w) or sequence 
level (s). /if(enc) specifies encoding stage 
and /if (dec) specifies decoding stage. 

• ef and ef denotes word-level and sentence- 
level embedding for word and sentence at po¬ 
sition t in terms of its residing sentence or 
document. 

3.2 Model 1: Standard LSTM 

The whole input and output arc treated as one 
sequence of tokens. Following Sutskever et al. 
(2014) and 0), we trained an autoencoder that first 
maps input documents into vector representations 
from a LSTM enco de and then reconstructs inputs 
by predicting tokens within the document sequen¬ 
tially from a LST A/decode- Two separate LSTMs 
are implemented for encoding and decoding with 
no sentence structures considered. Illustration is 
shown in Figure 1. 

3.3 Model 2: Hierarchical LSTM 

The hierarchical model draws on the intuition that 
just as the juxtaposition of words creates a joint 
meaning of a sentence, the juxtaposition of sen¬ 
tences also creates a joint meaning of a paragraph 
or a document. 

Encoder We first obtain representation vectors 
at the sentence level by putting one layer of LSTM 
(denoted as LSTM"°^ e ) on top of its containing 
words: 

K(t nc) = LSTM^(e?,h?_ i(enc)) (5) 

The vector output at the ending time-step is used 
to represent the entire sentence as 

— "’ends 

To build representation e o for the current doc¬ 
ument/paragraph D, another layer of LSTM (de¬ 
noted as LST T/ encode 06 )placed on top of all sen¬ 
tences, computing representations sequentially for 


each timestep: 

K (enc) = LSTM™ l :r(el, /if^enc)) (6) 

Representation e s endD computed at the final time 
step is used to represent the entire document: 

e D = Knd D - 

Thus one LSTM operates at the token level, 
leading to the acquisition of sentence-level rep¬ 
resentations that are then used as inputs into the 
second LSTM that acquires document-level repre¬ 
sentations, in a hierarchical structure. 

Decoder As with encoding, the decoding algo¬ 
rithm operates on a hierarchical structure with two 
layers of LSTMs. LSTM outputs at sentence level 
for time step t arc obtained by: 

K (dec) = TSTM" e (ef, ^_i(dec)) (7) 

The initial time step /iq(g?) = e/), the end-to-end 
output from the encoding procedure, /if (d) is used 
as the original input into LSTM d ° c r ode for subse¬ 
quently predicting tokens within sentence t + 1. 
LSTMfe^fe predicts tokens at each position se¬ 
quentially, the embedding of which is then com¬ 
bined with earlier hidden vectors for the next time- 
step prediction until the end s token is predicted. 
The procedure can be summarized as follows: 

hf (dec) = L5TM"(er, /^i(dec)) (8) 

p(w\-) = softmax(e w , h ™-1 (dec)) (9) 

During decoding, LST Mf° c r ode generates each 
word token w sequentially and combines it with 
earlier LSTM-outputted hidden vectors. The 
LSTM hidden vector computed at the final time 
step is used to represent the current sentence. 

This is passed to LSTM^f 0 t ^ ce , combined 
with hf for the acquisition of ht + 1 , and outputted 
to the next time step in sentence decoding. 

Lor each timestep t, LSTM |e C ”(f e nce ^ as to first 
decide whether decoding should proceed or come 
to a full stop: we add an additional token end/) to 
the vocabulary. Decoding terminates when token 
end/) is predicted. Details are shown in Ligure 2. 

3.4 Model 3: Hierarchical LSTM with 
Attention 

Attention models adopt a look-back strategy by 
linking the current decoding stage with input sen¬ 
tences in an attempt to consider which paid of the 
input is most responsible for the current decoding 



food any find didn't she • hungry was Mary 



Mary was hungry . she didn't find any food 

Figure 1: Standard Sequence to Sequence Model. 
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Mary was hungry . she didn't find any food 
Figure 2: Flierarchical Sequence to Sequence Model. 


food any find didn't she . hungry was Marry 



Figure 3: Hierarchical Sequence to Sequence Model with Attention. 


state. This attention version of hierarchical model each sentence from the inputs, outputted from 

is inspired by similar work in image caption gen- LSTM^^™ ce . Each element in H contains in- 

eration and machine translation (Xu et al., 2015; formation about input sequences with a strong fo- 

0). cus on the parts surrounding each specific sentence 

Let H = {hi(e),h, 2 (e),...,h s N (e)} be the (time-step). During decoding, suppose that ef de¬ 
collection of sentence-level hidden vectors for notes the sentence-level embedding at current step 








and that hf_ 1 ( dec) denotes the hidden vector out¬ 
putted from LSTM^^ ce at previous time step 
t— 1. Attention models would first link the current- 
step decoding information, i.e., hf_ 1 (dec) which 
is outputted from LSTM^ tence with each of the 
input sentences i e [1, AT], characterized by a 
strength indicator vf. 

Vi = U T f(W 1 ■ hf_ , (dec) + W 2 ■ hf(enc)) (10) 


dataset 

S per D 

W per D 

W per S 

Hotel-Review 

OO 

oo 

124.8 

14.1 

Wikipedia 

8.4 

132.9 

14.8 


Table 1: Statistics for the Datasets. W, S and D re¬ 
spectively represent number of words, number of 
sentences, and number of documents/paragraphs. 
For example, “S per D” denotes average number 
of sentences per document. 
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The attention vector is then created by averaging 
weights over all input sentences: 


m t = ^2 a-ihf (enc) (12) 

i&[l,No] 

LSTM hidden vectors for cuiTent step is then 
achieved by combining q, e® and /i®_ 1 (dec): 
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ct = ft- c t -i +k-k (14) 

h s t = Of c t (15) 

where W G W. 4Kx3K . ht is then used for word 
predicting as in the vanilla version of the hierar¬ 
chical model. 


3.5 Training and Testing 

Parameters are estimated by maximizing likeli¬ 
hood of outputs given inputs, si mi lar to standard 
sequence-to-sequence models. A softmax func¬ 
tion is adopted for predicting each token within 
output documents, the error of which is first back- 
propagated through LSTMfe^ e to sentences, 
then through LSTM|® c ”^ e e nce to document repre¬ 
sentation cd, and last through LSTM* n e J^f e nce and 
LST to inputs. Stochastic gradient de¬ 
scent with minibatches is adopted. 

For testing, we adopt a greedy strategy with 
no beam search. For a given document I), eo 
is first obtained given already learned LSTM enco de 
parameters and word embeddings. Then in decod¬ 
ing, LSTM few ^" ce computes embeddings at each 
sentence-level time-step, which is first fed into the 
binary classifier to decide whether sentence de¬ 
coding terminates and then into LSTM de °J^ de for 
word decoding. 


4 Experiments 

4.1 Dataset 

We implement the proposed autoencoder on two 
datasets, a highly domain specific dataset consist¬ 
ing of hotel reviews and a general dataset extracted 
from Wkipedia. 

Hotel Reviews We use a subset of hotel reviews 
crawled from TripAdvisor. We consider only re¬ 
views consisting sentences ranging from 50 to 250 
words; the model has problems dealing with ex¬ 
tremely long sentences, as we will discuss later. 
We keep a vocabulary set consisting of the 25,000 
most frequent words. A special “<unk>” token 
is used to denote all the remaining less frequent 
tokens. Reviews that consist of more than 2 per¬ 
cent of unknown words are discarded. Our train¬ 
ing dataset is comprised of roughly 340,000 re¬ 
views; the testing set is comprised of 40,000 re¬ 
views. Dataset details arc shown in Table 1. 

Wikipedia We extracted paragraphs from 
Wikipedia corpus that meet the aforementioned 
length requirements. We keep a top frequent 
vocabulary list of 120,000 words. Paragraphs 
with larger than 4 percent of unknown words arc 
discarded. The training dataset is comprised of 
roughly 500,000 paragraphs and testing contains 
roughly 50,000. 

4.2 Training Details and Implementation 

Previous research has shown that deep LSTMs 
work better than shallow ones for sequence-to- 
sequence tasks (Vinyals et ah, 2014; Sutskever et 
al., 2014). We adopt a LSTM structure with four 
layer for encoding and four layer for decoding, 
each of which is comprised of a different set of pa¬ 
rameters. Each LSTM layer consists of 1,000 hid¬ 
den neurons and the dimensionality of word em¬ 
beddings is set to 1,000. Other training details arc 
given below, some of which follow Sutskever et al. 
(2014). 



• Input documents are reversed. 

• LSTM parameters and word embeddings are 
initialized from a uniform distribution be¬ 
tween [-0.08, 0.08]. 

• Stochastic gradient decent is implemented 
without momentum using a fixed learning 
rate of 0.1. We stated halving the learning 
rate every half epoch after 5 epochs. We 
trained our models for a total of 7 epochs. 

• Batch size is set to 32 (32 documents). 

• Decoding algorithm allows generating at 
most 1.5 times the number of words in inputs. 

• 0.2 dropout rate. 

• Gradient clipping is adopted by scaling gra¬ 
dients when the norm exceeded a threshold 
of 5. 

Our implementation on a single GPU 2 processes a 
speed of approximately 600-1,200 tokens per sec¬ 
ond. We trained our models for a total of 7 itera¬ 
tions. 

4.3 Evaluations 

We need to measure the closeness of the output 
(candidate) to the input (reference). We first adopt 
two standard evaluation metrics, ROUGE (Lin, 
2004; Lin and Hovy, 2003) and BLEU (Papineni 
et al., 2002). 

ROUGE is a recall-oriented measure widely 
used in the summarization literature. It measures 
the n-gram recall between the candidate text and 
the reference text(s). In this work, we only have 
one reference document (the input document) and 
ROUGE score is therefore given by: 

_ E gram „ ei n P utCount match (gramJ 

KUUuE n — ^ / \ 

Egram n einput COUnt (gram n ) 

(16) 

where count matc h denotes the number of n-grams 
co-occurring in the input and output. We report 
ROUGE-1, 2 and W (based on weighted longest 
common subsequence). 

BLEU Purely measuring recall will inappropri¬ 
ately reward long outputs. BLEU is designed to 
address such an issue by emphasizing precision, 
n-gram precision scores for our situation are given 
by: 

Egram eoutput COunt match(gram n ) 

precision = —= r - !i -7--— 

Egram n eout P ut COUnt (g r am n ) 

(17) 

2 Tesla K40m, 1 Kepler GK110B, 2880 Cuda cores. 


BLEU then combines the average logarithm of 
precision scores with exceeded length penaliza¬ 
tion. For details, see Papineni et al. (2002). 

Coherence Evaluation Neither BLEU nor 
ROUGE attempts to evaluate true coherence. 
There is no generally accepted and readily avail¬ 
able coherence evaluation metric. 3 Because of 
the difficulty of developing a universal coherence 
evaluation metric, we proposed here only a 
tailored metric specific to our case. Based on the 
assumption that human-generated texts (i.e., input 
documents in our tasks) arc coherent (Barzilay 
and Lapata, 2008), we compare generated outputs 
with input documents in terms of how much 
original text order is preserved. 

We develop a grid evaluation metric si mi lar to 
the entity transition algorithms in (Barzilay and 
Lee, 2004; Lapata and Barzilay, 2005). The key 
idea of Barzilay and Lapata’s models is to first 
identify grammatical roles (i.e., object and sub¬ 
ject) that entities play and then model the transi¬ 
tion probability over entities and roles across sen¬ 
tences. We represent each sentence as a feature- 
vector consisting of verbs and nouns in the sen¬ 
tence. Next we align sentences from output doc¬ 
uments to input sentences based on sentence-to- 
sentence FI scores (precision and recall arc com¬ 
puted similarly to ROUGE and BLEU but at sen¬ 
tence level) using feature vectors. Note that multi¬ 
ple output sentences can be matched to one input 
sentence. Assume that sentence s* utput is aligned 
with sentence s?L ut , where i and i! denote position 
index for a output sentence and its aligned input. 
The penalization score L is then given by: 


-^output ' (-^output 1) 

x E E 10-0-O'-01 

iE[l,iVoutput 1] ? AToutput] 

(18) 

Equ. 18 can be interpreted as follows: (j — i) 
denotes the distance in terms of position index be¬ 
tween two outputted sentences indexed by j and i, 

3 Wolf and Gibson (2005) and Lin et al. (2011) proposed 
metrics based on discourse relations, but these are hard to ap¬ 
ply widely since identifying discourse relations is a difficult 
problem. Indeed sophisticated coherence evaluation metrics 
are seldom adopted in real-world applications, and summa¬ 
rization researchers tend to use simple approximations like 
number of overlapped tokens or topic distribution similarity 
(e.g., (Yan et al., 2011b; Yan et al., 2011a; Celikyilmaz and 
Hakkani-Tur, 2011)). 



Input-Wiki 

Washington was unanimously elected President by the electors in both the 1788 - 1789 and 
1792 elections . he oversaw the creation of a strong, well-financed national government that 
maintained neutrality in the french revolutionary wars , suppressed the whiskey rebellion , and 
won acceptance among Americans of all types . Washington established many forms in govern¬ 
ment still used today , such as the cabinet system and inaugural address . his retirement after 
two terms and the peaceful transition from his presidency to that of john adams established a 
tradition that continued up until franklin d . roosevelt was elected to a third term . Washington 
has been widely hailed as the ” father of his country ” even during his lifetime. 

Output-Wiki 

Washington was elected as president in 1792 and voters <unk> of these two elections until 
1789 . he continued suppression <unk> whiskey rebellion of the french revolution war gov¬ 
ernment , strong , national well are involved in the establishment of the fin advanced operations 
, won acceptance . as in the government, such as the establishment of various forms of inau¬ 
guration speech Washington , and are still in use . <unk> continued after the two terms of his 
quiet transition to retirement of <unk> <unk> of tradition to have been elected to the third 
paragraph . but , ” the united nations of the father ” and in Washington in his life , has been 
widely praised . 

Input-Wiki 

apple inc . is an american multinational corporation headquartered in cupertino , California , 
that designs , develops , and sells consumer electronics , computer software , online services , 
and personal com - puters . its bestknown hardware products are the mac line of computers , the 
ipod media player . the iphone smartphone , and the ipad tablet computer . its online services 
include icloud , the itunes store , and the app store . apple’s consumer software includes the os 
x and ios operating systems , the itunes media browser , the safari web browser , and the ilife 
and iwork creativity and productivity suites . 

Output-Wiki 

apple is a us company in California , <unk> , to develop electronics , softwares , and pc , sells 
. hardware include the mac series of computers , ipod . iphone . its online services , including 
icloud , itunes store and in app store . softwares , including os x and ios operating system , 
itunes , web browser , < unk> , including a productivity suite . 

Input-Wiki 

paris is the capital and most populous city of france . situated on the seine river , in the north of 
the country , it is in the centre of the le-de-france region . the city of paris has a population of 
2273305 inhabitants . this makes it the fifth largest city in the european union measured by the 
population within the city limits . 

Output-Wiki 

paris is the capital and most populated city in france . located in the <unk> , in the north of the 
country , it is the center of <unk> . paris , the city has a population of <num> inhabitants . 
this makes the eu ’ s population within the city limits of the fifth largest city in the measurement 

Input-Review 

on every visit to nyc , the hotel beacon is the place we love to stay . so conveniently located 
to central park , lincoln center and great local restaurants . the rooms are lovely . beds so 
comfortable , a great little kitchen and new wizz bang coffee maker . the staff are so accommo¬ 
dating and just love walking across the street to the fairway supermarket with every imaginable 
goodies to eat. 

Output-Review 

every time in new york , lighthouse hotel is our favorite place to stay . very convenient, central 
park , lincoln center , and great restaurants . the room is wonderful , very comfortable bed , a 
kitchenette and a large explosion of coffee maker . the staff is so inclusive , just across the street 
to walk to the supermarket channel love with all kinds of what to eat. 


Table 2: A few examples produced by the hierarchical LSTM alongside the inputs. 


and ( j' — i') denotes the distance between their 
mirrors in inputs. As we wish to penalize the 
degree of permutation in terms of text order, we 
penalize the absolute difference between the two 
computed distances. This metric is also relevant 
to the overall performance of prediction and re¬ 
call: an irrelevant output will be aligned to a ran¬ 
dom input, thus being heavily penalized. The de¬ 
ficiency of the proposed metric is that it concerns 
itself only with a semantic perspective on coher¬ 
ence, barely considering syntactical issues. 

4.4 Results 

A summary of our experimental results is given 
in Table 3. We observe better performances for 


the hotel-review dataset than the open domain 
Wikipedia dataset, for the intuitive reason that 
documents and sentences are written in a more 
fixed format and easy to predict for hotel reviews. 

The hierarchical model that considers sentence- 
level structure outperforms standard sequence- 
to-sequence models. Attention models at the 
sentence level introduce performance boost over 
vanilla hierarchical models. 

With respect to the coherence evaluation, the 
original sentence order is mostly preserved: the hi¬ 
erarchical model with attention achieves L = 1.57 
on the hotel-review dataset, equivalent to the fact 
that the relative position of two input sentences 
arc permuted by an average degree of 1.57. Even 



Model 

Dataset 

BLEU 

ROUGE-1 

ROUGE-2 

Coherence(L) 

Standard 

Hotel Review 

0.241 

0.571 

0.302 

1.92 

Hierarchical 

Hotel Review 

0.267 

0.590 

0.330 

1.71 

Hierarchical+Attention 

Hotel Review 

0.285 

0.624 

0.355 

1.57 

Standard 

Wikipedia 

0.178 

0.502 

0.228 

2.75 

Hierarchical 

Wikipedia 

0.202 

0.529 

0.250 

2.30 

Hierarchical-i- Attention 

Wikipedia 

0.220 

0.544 

0.291 

2.04 


Table 3: Results for three models on two datasets. As with coherence score L, smaller values signifies 
better performances. 


for the Wikipedia dataset where more poor-quality 
sentences are observed, the original text order can 
still be adequately maintained with L = 2.04. 

5 Discussion and Future Work 

In this paper, we extended recent sequence-to- 
sequence LSTM models to the task of multi¬ 
sentence generation. We trained an autoencoder 
to see how well LSTM models can reconstruct in¬ 
put documents of many sentences. We find that 
the proposed hierarchical LSTM models can par¬ 
tially preserve the semantic and syntactic integrity 
of multi-text units and generate meaningful and 
grammatical sentences in coherent order. Our 
model performs better than standard sequence-to- 
sequence models which do not consider the intrin¬ 
sic hierarchical discourse structure of texts. 

While our work on auto-encoding for larger 
texts is only a preliminary effort toward allowing 
neural models to deal with discourse, it nonethe¬ 
less suggests that neural models arc capable of en¬ 
coding complex clues about how coherent texts are 
connected . 

The performance on this autoencoder task could 
certainly also benefit from more sophisticated neu¬ 
ral models. For example one extension might 
align the sentence currently being generated with 
the original input sentence (similar to sequence- 
to-sequence translation in (0)), and later transform 
the original task to sentence-to-sentence genera¬ 
tion. However our long-term goal here is not on 
perfecting this basic multi-text generation scenario 
of reconstructing input documents, but rather on 
extending it to more important applications. 

That is, the autoencoder described in this work, 
where input sequence X is identical to output Y, is 
only the most basic instance of the family of doc¬ 
ument (paragraph)-to-document (paragraph) gen¬ 
eration tasks. We hope the ideas proposed in 
this paper can play some role in enabling such 


more sophisticated generation tasks like summa¬ 
rization, where the inputs are original documents 
and outputs arc summaries or question answering, 
where inputs are questions and outputs arc the ac¬ 
tual wording of answers. Sophisticated genera¬ 
tion tasks like summarization or dialogue systems 
could extend this paradigm, and could themselves 
benefit from task-specific adaptations. In sum¬ 
marization, sentences to generate at each timestep 
might be pre-pointed to or pre-aligned to specific 
aspects, topics, or pieces of texts to be summa¬ 
rized. Dialogue systems could incorporate infor¬ 
mation about the user or the time course of the 
dialogue. In any case, we look forward to more 
sophi4d applications of neural models to the im¬ 
portant task of natural language generation. 
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